Remote sensing inversion on heavy metal content in salinized soil of Yellow River Delta based on Random Forest Regression—a case study of Gudao Town

To explore the potential of using the mineral alteration information extracted by remote sensing technology to indirectly estimate the heavy metal content of salinized soil, 23 sampling points were uniformly set up in the town of Gudao in the Yellow River Delta as the research area in 2022. The concentrations of seven heavy metals, Cr, Cu, Pb, Zn, As, Mn and Ni, at the sampling points were determined in laboratory tests. Spectral derivative indices, topographic factors, and mineral alteration information (iron staining, hydroxyl, and carbonate ions) were extracted and screened as modeling factors using Sentinel 2 imagery. An inverse model of heavy metal content was constructed using the random forest algorithm, and the model accuracy was evaluated using the cross-validation method. The results of the study show that: (1) Hydroxyl and carbonate ion alteration can be effectively used for the inversion of soil As and Ni content in this study area. Iron-stained alteration can be used as a modeling factor in the inversion of Cr, Cu, Pb, Zn, and Mn concentrations. (2) The inclusion of alteration information improves the accuracy of heavy metal content inversion. The Cu concentration was verified to be the best predictor, with an RMSE of 3.309, MAPE of 11.072%, and R2 of 0.904, followed by As, Ni, and Zn; the predictive value of Mn, Cr and Pb was average. (3) Based on the results of concentration inversion, the high concentration areas of As, Ni, and Mn are primarily distributed on both sides of the river and around lakes and ponds. The high-concentration areas of Zn were mainly distributed in the farmland areas on both sides of the river. Areas with high concentrations of Cu were mainly distributed in the eastern oil extraction area, both sides of the rivers, and around lakes.


Data acquisition and preprocessing
1. Soil sample data Soil samples were collected in the field on July 25-27, 2022 in Gudao, and were mainly distributed in the eastern part of the study area in the oil extraction area, in the central part of the agricultural area, and to a lesser extent, on the sides of riverbanks and pits, at 23 sampling sites.Surface weeds and stones were removed by digging to a depth of 0-30 cm using a shovel.Soil samples were collected in polyethylene plastic bags weighing approximately 2.5 kg per sampling point.Using a Global Navigation Satellite System (GNSS) receiver to receive signals from a continuously operating reference station (CORS), and using Real-Time Kinematic (RTK) method to locate the sampling point, the coordinate system is WGS-84.Sealed bags are labeled with the time of sampling, latitude and longitude, and sample number.Transportation in black plastic bags protected from light at the end of sampling.Afterwards, the collected samples are sent back to a professional testing organization for testing.First, the samples were removed from impurities such as stones and plant fragments.Then, soil samples were dried after natural air drying, grinded with a glass rod and passed through a 100 mesh nylon aperture sieve, and placed in clean sample bags.A German-made wavelength dispersive X-ray fluorescence spectrometer (Bruker S8 TIGER) was utilized for the detection of heavy metal elements in soil samples.The metal element concentrations were detected using wavelength-dispersive X-ray fluorescence spectrometry (HJ780-2015).The detection limits were 3.0 mg/kg for Cr, 10.0 mg/kg for Mn, 1.5 mg/kg for Ni, 1.2 mg/kg for Cu, and 2.0 mg/kg for Zn, As, and Pb.
The numerical characteristics of the seven metal elements were determined based on the results of the laboratory tests.The extreme values, arithmetic means, and standard deviations of the metal contents were selected as statistical indicators and compared with the background values of soil chemistry in Dongying City.The calculated comparisons showed that the percentages of samples with Cu, Pb, Zn, Cr, Mn, As, and Ni contents exceeding the soil chemical background values in Dongying were 0%, 10%, 2.5%, 5%, 0%, 7.5%, and 10%, respectively.Based on the distribution of metallic elements in the study area, a histogram rectangle with group distance as the bottom edge and frequency as the height was plotted to show the content of each metallic element at the 23 sampling points and the exact distribution of the elements (Fig. 2).remote sensing imagery acquired is at the L2A level and have been processed for radiometric calibration and atmospheric correction, and only cropping was required before image processing and analysis.Water disturbance information was extracted by MNDWI, and band3 and band11 were selected to participate in the calculation.

Remote sensing image data
A segmentation threshold of − 0.12 < WNDWI < − 1 was used to create a water mask file, and the image was masked using the mask file.

Theory and methodology
Modeling factors

Spectral derivative indices
Various vegetation covered on the ground absorbs some heavy metal elements from the soil during its growth process, as well as soil moisture, brightness, and color, can affect satellite sensors to a certain extent when they receive reflected waves from the ground.Currently, the use of vegetation indices to invert land-cover change is gradually becoming an important research tool for understanding global environmental change 26,27 .Therefore, in this study, a variety of spectral indices reflecting the surface conditions, also known as spectral derivative indices, were selected to enhance specific information in remote sensing images.
The spectral derivative indices selected for the study were the Normalized Difference Vegetation Index (NDVI), Clay Mineral Ratio (CMR), Modified Normalized Difference Water Index (MNDWI), Ratio Vegetation Index (RVI), Difference Vegetation Index (DVI), modified soil-adjusted vegetation index (MSAVI), Enhanced Vegetation Index (EVI), and Red-Green Ratio Index (RGRI), as shown in Table 1.

Topographic factors
The ASTER GDEM Digital Elevation Model (DEM), which covers the entire study area at a resolution of 30 m, was downloaded from the Geospatial Data Cloud website.The ASTER GDEM is an easily accessible, high-precision DEM that covers virtually all of the Earth's landmass and is available to all users, regardless of the size or location of their target area.The DEM data were further used for terrain factor extraction, including aspect and slope.

Alteration information
The alteration of surrounding rocks refers to the process by which rocks and minerals undergo changes to their original physical and chemical properties due to hydrothermal activity, leading to the formation of new mineral assemblages 33 .Iron oxides and clay minerals have a certain accompanying relationship with heavy metals in soil, and information on iron-stained alteration, -OH, and CO 3 2− alteration may reflect the distribution of the two types of minerals.Principal component analysis (PCA) is one of the most commonly used methods for the extraction of alteration information.This is the process of reducing the correlation between images using a linear transformation to represent the information in each band with the fewest comprehensive indicators, thus highlighting the useful information contained in the image.By performing PCA on the water masked image, www.nature.com/scientificreports/ the information is mainly concentrated in the first component so that the magnitude of the contribution of the generated principal components can be reflected in the magnitude of the eigenvector matrix weights 34,35 .
After specifically analyzing the actual situation of the study area, we chose to extract iron stained, -OH, and CO 3 2− alteration information.
In 1978, Hunt et al. found that minerals containing ions or ionophores such as Fe 2+ , Fe 3+ , -OH, and CO 3 2− have unique absorption properties in the visible-near-red band by examining the reflectance spectral characteristics of minerals 36 .With the help of rock and mineral spectral curves obtained from the U.S. Geological Survey (USGS) standard spectral library, the characteristic absorption peaks of the iron-stained minerals occur at 0.40-0.50μm, 0.80-0.92μm, 1.39-1.41μm, and 1.90-1.92μm, and the characteristic reflectance peaks are mainly distributed at 0.65-0.72 μm, 1.23-1.30μm, and 1.60-1.70μm.The characteristic absorption bands of -OH and CO 3 2− ionophores were generally distributed at 0.86-0.98μm, 1.32-1.44μm, and 2.15-2.55μm, in which the most significant absorption feature was located at the vicinity of 2.33 μm, and the characteristic reflectance bands were mainly distributed at 0.49-0.58μm, 1.60-1.70μm, and 2.00-2.14μm.
Based on the spectral characteristics of iron-stained minerals, after multiple band selections, two absorption bands at 0.40-0.50μm and 0.80-0.92μm, as well as two reflection bands at 0.65-0.72 μm and 1.60-1.70μm were selected.These correspond to bands 2, 8, 4, and 11 of Sentinel-2A imagery, respectively.PCA (Table 2) was performed to extract information related to the alterations in iron staining.Based on the characteristics of the eigenvectors of the anomalous principal components, the absolute values of the loading coefficients for bands 2 and 4 must be relatively large and have opposite signs: one negative and one positive.Additionally, the loading coefficients for bands 8 and 11 should have opposite signs: one negative and one positive.By analyzing the PCA eigenvector matrix of the Sentinel-2A images in the study area, the loading coefficients of bands 2, 8, 4, and 11 in the principal component of PC4 were found to be consistent with the requirements.Thus, we conclude that the iron-stained alteration information was located in the PC4 component.
Based on the spectral characteristics of -OH and CO 3 2− minerals, two absorption bands of 0.86-0.98μm and 2.15-2.55μm, and two reflection bands of 0.49-0.58μm and 1.60-1.70μm, corresponding to bands 8, 12, 2, and 11 of the Sentinel-2A image, were finally selected by several band selections and were subjected to PCA (Table 3) to extract the -OH and CO 3 2− alteration information.Depending on the characteristics of the eigenvectors of the anomalous principal components, the signs of the loading coefficients of band 2 and band 8 should be opposite: one positive and one negative, and the signs of the loading coefficients of band 11 and band 12 should also be oppositive: one positive and one negative.The results of the analysis presented in Table 3 show that the loading coefficients of bands 8, 12, 2, and 11 in the PC4 principal component of the Sentinel-2A image in the study area meet the requirements.Therefore, the -OH and CO 3 2− alteration information of the Sentinel-2A image of Gudao Town was also determined to be located in the PC4 component.

Complex covariance test
A strong correlation between the factors distorts the model estimation.Therefore, the above factors must be tested for complex covariance to prevent the existence of serious complex covariance among them.
The Variance inflation test (VIF) was used to test the linear correlation between the factors.The formula is as follows: The test statistic used to examine the explanatory ability of the model is R 2 (sample decidability coefficient).The magnitude of R 2 determines the degree of correlation between factors.A smaller R 2 value indicates that this (1)

Random Forest Regression model (RFR model)
Random Forest Regression (RFR) is the combination of multiple CART decision trees as weak learners to form a "forest" to make predictions on a dataset 40 .A single decision tree can only take a single classifier when making decisions, which inevitably results in disadvantages, such as the decision tree being able to select only one type of attribute to analyze at a time or overfitting the model to accommodate noise.To address these shortcomings, the algorithm builds a forest using randomization, in which the decision trees in the forest are not associated with each other.Multiple decision trees constructed to average errors and improve prediction accuracy are the core concepts of random forests.
To determine the accuracy of the model, three accuracy indicators used in this study were root mean square error (RMSE), mean absolute percentage error (MAPE), and R 2 .
RMSE is the square root of the ratio of the square of the deviation of the predicted value from the true value to the number of observations, which is of the same order of magnitude as the true value; it measures the deviation of the predicted value from the true value and is more sensitive to outliers in the data.The formula is as follows: MAPE is the average of the deviation of the predicted value from the true value and the absolute value of the true value, and is a relative metric that is commonly used as a statistical measure of prediction accuracy.The formula is as follows: The R 2 statistic is an important measure of the goodness-of-fit of the model and is calculated as the ratio of the regression sum of squares to the total sum of squares.The numerical values reflect the relative contributions of the regression.The formula is as follows: The range for RMSE and MAPE values is [0, + ∞); when the gap between the predicted and true values is smaller, the closer the value is to 0, the higher the model accuracy; therefore, the smaller the RMSE and MAPE values of the model, the higher the model efficacy.Compared to RMSE, MAPE provides a more realistic reflection of the deviation between the predicted and actual values, presented as a percentage to enhance user understanding.However, the calculation is limited when the actual values are close to or equal to zero.By contrast, RMSE is applicable to any dataset and is more sensitive to outliers in the data.The R 2 value ranges from 0 to 1, with values closer to 1 indicating a higher model accuracy and stronger predictive capability.

Modeling factor extraction and screening
The correlations between the contents of the seven heavy metals were analyzed to obtain a comparable matrix of correlation coefficients, indicating a correlation between the variables (Fig. 3). Figure 3 shows that the correlations of As, Mn, Ni, Cu, and Cr were better.
Spectral data, including six spectral bands and eight derived spectral indices, were extracted from the Sentinel-2A image of Gudao and analyzed for correlations between the concentration of each heavy metal and the spectral data (Table 4).As shown in Table 5, the correlation between the elemental concentrations of Mn, Ni, As, Cu, and Zn and the six spectral bands was better, whereas the correlation between Pb and Cr and the six spectral bands was worse.Regarding spectral indices, the elemental concentration of Mn correlated well with MSAVI, MNDWI, RGRI, and NDVI, and the concentration of Ni correlated well only with MNDWI; all eight spectral indices correlated well with As, and the elemental concentrations of Cu and Zn correlated well with MSAVI, MNDWI, and RGRI.
The topographic factors included DEM, aspect and slope, and the correlation between the concentration of each heavy metal element and the topographic factors was calculated separately (Table 5).The coefficients in Table 5 show that the concentrations of six metal elements (Mn, Ni, As, Cu, Zn, and Cr) had the best correlation with aspect, that is they had the greatest effect on the metal elements, and the concentration of Pb had the best correlation with slope.However, the correlation coefficients of each terrain factor with metals were relatively low, and other factors must be considered in the subsequent selection of the modeling factors to select the most appropriate terrain factor.
Sentinel-2A images of the study area were processed using PCA to extract the PC4 component that could reflect the alteration information and analyze the correlation between the concentration of each heavy metal element and the alteration information (Table 6).From the data in Table 7, it can be seen that, overall, the concentrations of five metal elements, Mn, Ni, Cu, Zn, and Cr, were well correlated with iron-stained alteration, and As Vol:.( 1234567890   and Pb had good correlations with -OH and CO 3 2− alteration.However, the correlation coefficients between the concentration of each metal element and the alteration information did not differ significantly from each other.
According to the results of the above correlation analysis, modeling factors with higher correlation coefficients were screened for the complex covariance test by comprehensively considering the spectral information, topographic factors, and mineral alteration information.Factors with VIF values greater than 10 were discarded, and the final modeling factors that were retained are listed in Table 7.

Inverse modeling of soil heavy metal content
The original data is first randomly divided into training and validation sets in the ratio of 7:3.For the modeling factors selected for each heavy metal element in the study area, the RFR algorithm was used to construct an inverse model of the soil heavy metal concentrations in the region.The inversion results are listed in Table 8.As shown in Table 8, the prediction accuracy of the concentration of each element was improved by adding alteration information.The elemental concentration of Cu was predicted to be the best with an RMSE of 3.309, MAPE of 11.072%, and R 2 of 0.904.This was followed by As, Ni, and Zn, all with R 2 greater than 0.5 and significantly lower RMSE and MAPE.Although the prediction of the elemental concentrations of Pb, Cr, and Mn was improved by adding the alteration information, the prediction accuracy remained average.The analysis revealed that the low prediction accuracy of Cr and Pb may be due to the presence of a high number of outliers in the prediction model.
The corresponding scatter plots were obtained by comparing the measured values with the predicted concentrations of each heavy metal in the study area (Fig. 4).Statistical analysis of the predicted and measured values revealed that the inclusion of alteration information improved the accuracy of the model, and the prediction results were more stable.The scatterplot of the total sample shows that the closer it is to the y = x line, the better the model fit.The sample-predicted and measured values of the RFR model with the addition of alteration information were mostly concentrated around the 1:1 line, and the model fitting was superior to that predicted by the model without the addition of alteration information.

Spatial distribution of soil heavy metal content
The spatial distribution of heavy metal concentrations was estimated using the constructed RF model based on spectral derivative indices, topographic factors, and alteration information after screening for the seven heavy metal elements in the study area.The inversion results are shown in Fig. 5.The analysis results showed that the spatial distribution of the elemental concentrations of As, Ni, and Mn was generally consistent, with the highconcentration areas mainly distributed along the banks of rivers, lakes, and pits, followed by higher concentrations in the eastern oil extraction areas and lower concentrations in the agricultural areas.The Yellow River carries a large amount of sediment downstream from its source, and some of the sediments form piles along the banks.High concentrations on both sides of the river and around the lake may be due to compound industrial   are similar to the spatial distribution maps of the metals obtained by Yu et al. 41 .using the kriging interpolation method.The feasibility of using mineral alteration information extracted based on remote sensing images for predicting heavy metal content is again demonstrated.Moreover, the spatial distribution maps of heavy metal element concentrations obtained based on this paper's method are more detailed than those obtained based on the Kriging interpolation method.In terms of the drivers of heavy metal aggregation, the causes of Cu, Pb, and Zn aggregation analyzed in this paper have similarities with the sources of Cu, Pb, and Zn aggregation analyzed in the study by Zhang et al. 42 .It provides a more scientific basis for further monitoring the heavy metal content of soil on a large scale and strengthening regional land quality management.

Conclusions
This study used the town of Gudao in the Yellow River Delta as the study area and obtained seven heavy metal concentrations from 23 sampling points through on-site sampling and analysis.Based on Sentinel-2A images, spectral derivative indices, terrain factors, and mineral alteration information were extracted and filtered.A soil heavy metal concentration inversion model was constructed using the RFR algorithm, and the accuracy of the model was evaluated to achieve the spatial distribution inversion of heavy metal concentrations within the region.The results of this study are as follows: 1. -OH and CO 3 2− alterations can be effectively used for the inverse modeling of soil As and Ni contents.The RMSE and MAPE metrics of the prediction model were significantly reduced, and the R 2 improved from 0.2 to more than 0.5.Improvement of prediction by applying iron-stained alterations to Cr, Cu, Pb, Zn, and Mn elemental concentration inversion.In particular, the prediction accuracy of Cu elemental concentration was significantly improved with an RMSE of 3.309, MAPE of 11.072%, and R2 of 0.904, followed by elemental Zn with an RMSE of 5.065, MAPE of 8.01%, and R2 of 0.523, whereas the accuracy of Mn, Cr, and Pb improved and remained at a low level.However, for the entire sample, the correlation between the measured and predicted values of Mn was better, and the low prediction accuracy of Cr and Pb may have been due to the existence of points with higher concentrations.2. The spatial distribution of soil heavy metal concentrations in the study area was inverted based on the constructed model.It was found that the areas of high concentration of As, Ni and Mn elements were mainly located on the banks of rivers, around lakes and ponds, followed by higher concentrations in the eastern oil extraction areas and lower concentrations in the agricultural areas.The high-concentration area of Cu is consistent with the distribution of the high-concentration areas of the above three elements, mainly in the eastern oil extraction area, both sides of the river, and around the lake.Areas with high Zn concentrations were mainly located in agricultural areas on both sides of the river.High concentrations of Cr and Pb are distributed over a wide area, with lower concentrations found only in some areas of the pits.The results of this study provide a scientific basis for the assessment and management of soil quality.For the perimeters of points with abnormally high heavy metal concentrations, further research is required in terms of field investigation, data analysis, and model construction.The number of sampling points can be increased and the range can be further expanded so that more data can be used for modeling and the applicability of the model can be improved.Meanwhile, at the level of spectral response mechanism, the correlation study between heavy metal elements and alteration information needs to be further deepened.

Figure 1 .
Figure 1.Geographic location of the research area.

Figure 4 .
Figure 4. Comparison of the accuracy of inversion models for heavy metal concentrations.

Figure 5 .
Figure 5. Spatial distribution of heavy metal element concentrations from Random Forest Modeling.
, three types of spatial resolution(10, 20, and 60 m) and a width of 290 km.Sentinel-2A imagery combines the strengths of the Landsat and Spot satellites with high-resolution multispectral features and relatively stable quality.Therefore, in this study, Sentinel-2A data on October 23, 2022 were selected for an inversion study of soil heavy metal content in the Yellow River Delta region.Images covering the entire study area were downloaded from the official website of the European Space Agency.The downloaded Sentinel-2A remote sensing images were imported into SNAP and exported in ENVI format after 10 m resampling.Band fusion is performed on the exported files in the ENVI platform to obtain remote sensing images displayed in real bands.As the Sentinel-2A Vol.:(0123456789) Scientific Reports | (2024) 14:11216 | https://doi.org/10.1038/s41598-024-62087-ywww.nature.com/scientificreports/bands

Table 3 .
PCAfactor is less linearly correlated with the other factors.If the VIF value is greater than 5, it indicates the presence of multicollinearity; if the VIF value is greater than 10, it indicates the presence of severe collinearity and the factor needs to be discarded.

Table 4 .
Correlation coefficients between heavy metal content and spectral data.*Significant correlation; **extremely significant correlation.

Table 5 .
Correlation coefficients between heavy metal concentrations and topographic factors.Significant values are in bold.*Significant correlation; **extremely significant correlation.

Table 6 .
Correlation coefficients between heavy metal element concentrations and etching information.

Table 7 .
Factor screening for modeling heavy metal element concentrations.

Table 8 .
Comparison of the accuracy of models for predicting soil heavy metal concentrations.Significant values are in bold.